Introduction

According to the American Time Use Survey (ATUS, https://www.bls.gov/charts/american-time-use/activity-by-work.htm), 85.8 percent of males and 66.5 percent of females work more than 40 hours per week in the U.S. Working overtime has been associated with diverse health problems, especially ischemic heart diseases and cerebrovascular diseases (Nishiyama & Johnson, 1997). Therefore, this project aims to examine whether working overtime is associated with the blood pressure level using the National Health and Nutrition Examination Survey (NHANES) data. In order to produce sufficient estimation sample size, this study uses a pooled data from the three two-year cycle data sets: 2011-2012, 2013-2014, and 2015-2016.


It is well known that various biological and environmental factors influence the blood pressure level (Lauer, 1993; Papathanasiou, 2015). To control for the potential influences of such factors, we have systematically reviewed the literature to select the important factors that might affect the blood pressure levels of the respondents. Final predictors, other that working hours, includes in the model were gender, age, body mass index (BMI), alcohol consumption level, hours of sleep, and history of smoking. These final predictors closely match with those from other published studies that observed the association between work hours and blood pressure levels (Hayashi, 1996; Nakamura et al. 2012).


In terms of defining the criteria for the key variables of interest, working overtime is defined as working more than 40 hours per week. For the additional logistic regression analyses, abnormal blood pressure is defined as having systolic blood pressure higher than 140mm Hg and diastolic blood pressure higher than 90mm Hg following the guide from the American Heart Association.


Data Selction and Cleaning

The study population are participants, aged 18 - 60 and employed, of The National Health and Nutrition Examination Survey (NHANES) program in three two-years cycles: 2011-2012, 2013-2014 and 2015-2016. Among them, 2011-2012 cycle consists of 2,495 sample size, 2013-2014 cycle contains 2,819 and 2015-2016 cycle includes 2,799. For the purpose of this study, we exclude people who are now taking prescription due to hypertension. The exclusion leads to huge amount of reduction to our sample size: there are 100 observations left for 2011-2012, 95 for 2013-2014 and 140 for 2015-2016.


Among 335 total sample sizes, we pick variables through literature reviews and group discussion. Dataset varies for each two-years cycle by the suffix in the name. 2011-2012 is G, 2013-2014 is H, and 2015-2016 is I. For the convenience, the suffix will be replaced by *. Variables chosen by this project are listed below:


The measurements of blood pressure: Systolic blood pressure and Diastolic blood pressure from BPX_* dataset. Each blood pressure is measured in 3 trials, sometimes 4 trials,


Working hours in the last week from OCQ_* dataset,


Body mass index from BMX_* dataset,


Demographic data: age and gender from DEMO_* dataset,


Alchol used in the last 12 months: ALQ120Q for the quantity and ALQ120U to differentiate unit of week, month or year, from ALQ_* dataset,


Smoked tobacco in last five days: SMQ681 in 2013-2016, and SMQ680 in 2011-2012, from SMQRTU_* dataset,


Sleeping hours in working days: SLD012 in 2015-2016, and SLD010H for others from SLQ_* dataset.


The cleaning process involves excluding missing values for complete case analysis, and excludig cases where there are more than 2 trials blood pressure measurements missing. After the cleaning process, the sample size for this study is 238 in total.

Method

This study implements multiple regression on each type (systolic and diatolic) blood pressure against working hours controlled by all other variables to assess if working overtimes would cause higher blood pressure. Coefficients and their p-values on statistic significance are reported.


The response variable blood pressure is the average value of blood pressures measured in 3 trials (sometimes 4 if one previous trial is missing); the variable working hours is encoded into dichotomy variable where 1 stands for working more than 40 hours and 0 otherwise; alchol drinks are converted to average drinks in a week, so values with unit month are divided by 4.345 and by 52.143 for values with unit year. 3 digits are preserved; due to the difference of encoding of sleep variable in 2015-2016 (it has 14 meaning more 14 hours, but this limit is 12 in 2011-2014), we recode all sleep hours that are more than 12 to 12 to keep the consistency in sample dataset.


The additional analysis involves adding interactions, transformations to our classical linear model. Logistics regression is applied to examine the associations between the predictors and the risk of having abnormal blood pressure.


We use data.table package, dplyr package and python for the data preparation process. The multiple regression is implemented in python, lm function and gls function in R.

Core Analysis and Results

lm and R_data.table

After data cleaning, we visualized our data with the matrix plots first. As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.

pairs( ~avg_sys_bp + gender + age + bmi + sleep + smoke + workhrs +
         alchol, data = analysis_dt, main = "simple matrix plot for Systolic blood pressure")

pairs( ~avg_dia_bp + gender + age + bmi + sleep + smoke + workhrs +
         alchol, data = analysis_dt, main = "simple matrix plot for Diastolic blood pressure")

par( mfrow=c(3,2) )
hist(analysis_dt$avg_sys_bp)
hist(analysis_dt$avg_dia_bp)
hist(analysis_dt$age)
hist(analysis_dt$bmi)
hist(analysis_dt$sleep)
hist(analysis_dt$alchol)

After the preliminary analysis on the data, we fitted our linear regression models.


For the systolic blood pressure, we fitted the following model:

\[ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} \]

For diastolic blood pressure, we fitted the following model:
\[ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} \]

## systolic bp
pre_sys_fit = lm(avg_sys_bp ~ gender + age + bmi + sleep + smoke + workhrs + alchol,
             data = analysis_dt)

summary(pre_sys_fit)
## 
## Call:
## lm(formula = avg_sys_bp ~ gender + age + bmi + sleep + smoke + 
##     workhrs + alchol, data = analysis_dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.867 -12.271  -2.392  10.500  76.693 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 90.29276   11.82799   7.634 6.05e-13 ***
## gender      -3.62489    2.74377  -1.321 0.187770    
## age          0.47042    0.12723   3.697 0.000273 ***
## bmi          0.64716    0.20248   3.196 0.001588 ** 
## sleep        0.49185    0.98005   0.502 0.616243    
## smoke        1.19416    2.65097   0.450 0.652802    
## workhrs      1.57530    2.67687   0.588 0.556784    
## alchol       0.01928    0.71959   0.027 0.978645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.21 on 230 degrees of freedom
## Multiple R-squared:  0.08855,    Adjusted R-squared:  0.06081 
## F-statistic: 3.192 on 7 and 230 DF,  p-value: 0.003031
## diatolic bp
pre_dia_fit = lm(avg_dia_bp ~ gender + age + sleep + smoke + workhrs + alchol,
                 data = analysis_dt)
 
summary(pre_dia_fit)
## 
## Call:
## lm(formula = avg_dia_bp ~ gender + age + sleep + smoke + workhrs + 
##     alchol, data = analysis_dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.710  -8.225   0.728   8.255  33.789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.90643    6.50533  10.746   <2e-16 ***
## gender      -2.77089    1.93383  -1.433   0.1533    
## age          0.22960    0.09145   2.511   0.0127 *  
## sleep        0.07160    0.70738   0.101   0.9195    
## smoke       -0.04585    1.92066  -0.024   0.9810    
## workhrs      0.50563    1.94090   0.261   0.7947    
## alchol      -0.12419    0.51884  -0.239   0.8110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.94 on 231 degrees of freedom
## Multiple R-squared:  0.03657,    Adjusted R-squared:  0.01154 
## F-statistic: 1.461 on 6 and 231 DF,  p-value: 0.1924

From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant (\(\beta = 1.58\), p = 0.557). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well (\(\beta = 0.2246\), p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.


After fitting the linear regression model, we checked whether the following assumptions of the linear regression model are held:


1. Homoscedasticity: The variance of the residual is constant. 2. Linearity: The relationship between the independent and dependent variables is linear. 3. No or little multicollinearity: There is no or little collinearity between independent variables.


1. Homoscedasticity


In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.

par( mfrow=c(1,2) )
## standardized residuals plot for systolic bp
st_sys_red = rstandard(pre_sys_fit)
fitted_sys = pre_sys_fit$fitted.values

plot(fitted_sys, st_sys_red, xlab = "fitted values", 
     ylab = "standardized residuals", 
     main = "standardized residuals plot (systolic bp)")

## standardized residuals plot for diatolic bp
st_dia_red = rstandard(pre_dia_fit)
fitted_dia = pre_sys_fit$fitted.values

plot(fitted_dia, st_dia_red, xlab = "fitted values", 
     ylab = "standardized residuals", 
     main = "standardized residuals plot (diatolic bp)")

From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.


2. No or little multicollinearity


In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and conducted Chi-squared correlation test between each two binary independent variables.

cov_mat = vcov(pre_sys_fit)
cov_mat[-1, -1] 
##              gender          age         bmi        sleep       smoke
## gender   7.52828033 -0.022099117 -0.13212674 -0.280415922  0.43262585
## age     -0.02209912  0.016188377  0.00353704  0.007186922  0.04133382
## bmi     -0.13212674  0.003537040  0.04099638  0.020411778  0.02974110
## sleep   -0.28041592  0.007186922  0.02041178  0.960493109  0.07954100
## smoke    0.43262585  0.041333816  0.02974110  0.079541003  7.02762846
## workhrs  0.78274008 -0.024336217 -0.02142290 -0.109445473  0.80275098
## alchol   0.47160755 -0.011946925  0.01639422 -0.047810983 -0.34321186
##             workhrs      alchol
## gender   0.78274008  0.47160755
## age     -0.02433622 -0.01194692
## bmi     -0.02142290  0.01639422
## sleep   -0.10944547 -0.04781098
## smoke    0.80275098 -0.34321186
## workhrs  7.16563783  0.05594737
## alchol   0.05594737  0.51781435
## check correlation between indicators
gs = chisq.test(analysis_dt$gender, analysis_dt$smoke)
gw = chisq.test(analysis_dt$gender, analysis_dt$workhrs)
sw = chisq.test(analysis_dt$smoke, analysis_dt$workhrs)

## make tables
row = c("gender", "gender", "smoke")
col = c("smoke", "workhrs", "workhrs")
chi_sq_stats = c( gs$statistic, gw$statistic, sw$statistic )
p_value = c(gs$p.value, gw$p.value, sw$p.value)

m = as.data.table(cbind(row, col, chi_sq_stats, p_value))
m
##       row     col     chi_sq_stats            p_value
## 1: gender   smoke 3.15347205364872 0.0757655899731777
## 2: gender workhrs 1.10585776488131  0.292984154257689
## 3:  smoke workhrs 2.92869646016623 0.0870177267808233

As shown in the correlation table between each two continuous variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.


3. Linearity


In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.

avPlots(pre_sys_fit)

avPlots(pre_dia_fit)

From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.

additional analysis

After examing the mutiple linear regression in the core analysis, I’m interested in a more complicated model considering intercations and transformations to adjust the non-linearity and non-constant variance.


The following model considers the logarithm of average systolic blood pressure. For variables alchol and sleep, the model apply them into basis splines function with degrees of freedom 5 and order of three, with 2 knots. Interaction effect between worhrs and all other variables are considered. The estimations and significance level are reported in the following:

fit_sys = lm( log(avg_sys_bp) ~ workhrs * (age + gender + bmi + bs(alchol, 5) + 
                                             bs(sleep, 5) + smoke) + smoke:alchol,
              data = dt)

summary(fit_sys)
## 
## Call:
## lm(formula = log(avg_sys_bp) ~ workhrs * (age + gender + bmi + 
##     bs(alchol, 5) + bs(sleep, 5) + smoke) + smoke:alchol, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38186 -0.07910 -0.00831  0.06629  0.43370 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.4731977  0.1079382  41.442  < 2e-16 ***
## workhrs                -0.1317869  0.2596297  -0.508 0.612277    
## age                     0.0041774  0.0011083   3.769 0.000213 ***
## gender                 -0.0439523  0.0252797  -1.739 0.083585 .  
## bmi                     0.0054177  0.0017799   3.044 0.002639 ** 
## bs(alchol, 5)1          0.1191565  0.0617548   1.930 0.055035 .  
## bs(alchol, 5)2          0.0299150  0.0536695   0.557 0.577861    
## bs(alchol, 5)3          0.2629379  0.1224777   2.147 0.032970 *  
## bs(alchol, 5)4          0.0177825  0.1344588   0.132 0.894913    
## bs(alchol, 5)5          0.0317093  0.0789881   0.401 0.688508    
## bs(sleep, 5)1          -0.1787615  0.0981858  -1.821 0.070105 .  
## bs(sleep, 5)2           0.1319643  0.0798247   1.653 0.099811 .  
## bs(sleep, 5)3          -0.2097746  0.0955405  -2.196 0.029227 *  
## bs(sleep, 5)4           0.2610917  0.1322199   1.975 0.049634 *  
## bs(sleep, 5)5          -0.2125092  0.1311169  -1.621 0.106591    
## smoke                   0.0260160  0.0275418   0.945 0.345964    
## workhrs:age            -0.0018738  0.0020102  -0.932 0.352342    
## workhrs:gender          0.0469979  0.0421262   1.116 0.265867    
## workhrs:bmi             0.0009652  0.0031768   0.304 0.761566    
## workhrs:bs(alchol, 5)1 -0.0850794  0.1149464  -0.740 0.460038    
## workhrs:bs(alchol, 5)2 -0.1035043  0.0948057  -1.092 0.276211    
## workhrs:bs(alchol, 5)3 -0.1711782  0.2228871  -0.768 0.443360    
## workhrs:bs(alchol, 5)4 -0.2518054  0.2453225  -1.026 0.305889    
## workhrs:bs(alchol, 5)5  0.0244100  0.1585296   0.154 0.877777    
## workhrs:bs(sleep, 5)1   0.4848052  0.2807767   1.727 0.085721 .  
## workhrs:bs(sleep, 5)2   0.0927614  0.2088909   0.444 0.657459    
## workhrs:bs(sleep, 5)3   0.6274614  0.2494358   2.516 0.012645 *  
## workhrs:bs(sleep, 5)4  -0.2045744  0.3723757  -0.549 0.583339    
## workhrs:bs(sleep, 5)5   1.3149068  0.6421546   2.048 0.041857 *  
## workhrs:smoke          -0.1009870  0.0415018  -2.433 0.015810 *  
## smoke:alchol            0.0085202  0.0106651   0.799 0.425269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1335 on 207 degrees of freedom
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.1204 
## F-statistic: 2.081 on 30 and 207 DF,  p-value: 0.00154
AIC(fit_sys)
## [1] -252.4279

The model has R-squared 0.2317. The F-test against the null model calcutes the statistic of 2.081 with 30 and 207 degrees of freedom. The p-value is 0.00154, which suggests the statistical significance of this model in 0.05 confidence level. The AIC of this model is -252.428.

vif(fit_sys)
##                             GVIF Df GVIF^(1/(2*Df))
## workhrs               200.980384  1       14.176755
## age                     1.662069  1        1.289213
## gender                  2.074395  1        1.440276
## bmi                     1.823955  1        1.350539
## bs(alchol, 5)          25.003111  5        1.379747
## bs(sleep, 5)            6.382317  5        1.203643
## smoke                   2.421995  1        1.556276
## workhrs:age            25.662396  1        5.065807
## workhrs:gender          2.537148  1        1.592843
## workhrs:bmi            32.073578  1        5.663354
## workhrs:bs(alchol, 5) 223.496453  5        1.717620
## workhrs:bs(sleep, 5)  584.091711  5        1.890811
## workhrs:smoke           2.163470  1        1.470874
## smoke:alchol            4.161601  1        2.040000

When checking the variance inflation factor, I find that workhrs, workhrs:age, and workhrs:bmi are greatly inflated (more than 5). It suggests that there might be significant effect of collinearity, which may unstable the coefficients estimations.

par( mfrow = c(2,2) )
plot(Effect("age", fit_sys, partial.residuals = TRUE))

plot(Effect("bmi", fit_sys, partial.residuals = TRUE))

plot(Effect("alchol", fit_sys, partial.residuals = TRUE))

plot(Effect("sleep", fit_sys, partial.residuals = TRUE))

Above plots examine the effect of each continuous variables in the model on the response variable. The blue lines shows the relationship from the model. The red dots are partial residuals and the red line shows how each variable correlates to the response variable after controlling all other variables. We see that our model dose quite well on explaining the correlation for age, bmi and alchol. The basis spline function of Sleep, as we can see, become very unstable at the boundary points. This might suggests extrapolation bias.


The coefficients of this model is very hard to interpret, even though the model roughly satisfies the linear modeling assumptions. However, the model is still useful to explain several research questions. In the following example, I examine if sleep for 5 hours would cause significant difference on blood pressure than sleep for 8 hours if they both work overtimes.


To answer the question, if we directly use the model, we need to consider interaction terms on variables interested, and coefficients of basis splines transformation on sleep are more meaningful in mathematic rather than practical setting.


The code below create two fake datasets accroding to the original dataset. One with all sleep equals to 8 while the other equal to 5. Both dataset has workhrs as 1. After combining the original dataset with two fake datasets, we assign weights 1 to all original observations and zero to all fake observations. The purpose of assigning weights is to get the same coefficients with the previously fitted model and the same structure of the design matrix (same interactions and transformation, but with fake data). Above steps help us to get the contrast value by applying the model coefficients to our data generated by test assumptions.

# would difference between work overtimes but sleep for 8 hours and work overtimes 
# while sleep for 5 hours less significant?
dt = dt[, wgt := 1]

d_fake_8hrs = copy(dt)
d_fake_8hrs[, `:=` (workhrs = 1, sleep = 8, wgt = 0)] # make fake data with contrast

d_fake_5hrs = copy(dt)
d_fake_5hrs[, `:=` (workhrs = 1, sleep = 5, wgt = 0)] # make fake data with contrast

dx = rbindlist( list(dt, d_fake_8hrs, d_fake_5hrs) ) # combine tru data with fake data

result = lm( log(avg_sys_bp) ~ workhrs * (age + gender + bmi + bs(alchol, 5) + 
                                             bs(sleep, 5) + smoke) + smoke:alchol,
             weights = wgt, data = dx) # fit regression with zero weights on fake data

pa = coef(result) # parameters
cm = vcov(result) # covariance matrix
dm = model.matrix(result) # design matrix
nx = 238

ct = dm[ (nx+1):(nx+2*nx), ]
ct = ct[1:nx,] - ct[(nx+1):(2*nx), ] # get the contrast

znum = ct %*% pa # numerator of z-score
zdenom = sqrt(diag(ct %*% cm %*% t(ct))) # denominator of z-score
zscores = znum / zdenom

z_criticle = qnorm(0.975)

{mean(zscores) > z_criticle}
## [1] FALSE

The z-score is 0.864, which is lesser than 1.96. We failed to reject the null hypothesis that sleep 5 hours would cause difference in blood pressur than sleep 8 hours if both people work overtimes.

GLS and R_dplyr



After data cleaning, we visualized our data with the matrix plots first. As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.


#test the assumptions for the linear regression
#assumption1) normality assumption: check if the continuous variables are normal
par( mfrow=c(3,3) )
#DVs
hist(df$SBP)
hist(df$DBP)
#IVs
hist(df$age)
hist(df$bmi)
hist(df$sleep)
hist(df$alcohol_adj)

#check the distributions
hist(df$overtime)
hist(df$SBP_bi)
hist(df$DBP_bi)

#for SPB
pairs(SBP ~ overtime + gender + age + bmi +
        alcohol_adj + sleep + smoke, data = df)

#for DBP
pairs(DBP ~ overtime + gender + age + bmi + 
        alcohol_adj + sleep + smoke, data = df)

We wanted to check whether the following assumptions of the linear regression model are held:


1. Homoscedasticity: The variance of the residual is constant. 2. Linearity: The relationship between the independent and dependent variables is linear. 3. No or little multicollinearity: There is no or little collinearity between independent variables.

1. Linearity
In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.

#assumption3) linearity
#first fit a linear regression with all covariates
avplot_SBP = lm(SBP ~ overtime + gender + age + bmi + 
                  alcohol_adj + sleep + smoke, data = df)
avplot_DBP = lm(DBP ~ overtime + gender + age + bmi + 
                  alcohol_adj + sleep + smoke, data = df)
#then see the added variable plots
car::avPlots(avplot_SBP) #for SBP

car::avPlots(avplot_DBP) #for DBP


From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.

2. Homoscedasticity
In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.

#standardized residuals plot
#SBP
SBP_red = rstandard(avplot_SBP)
fitted_SBP = avplot_SBP$fitted.values

plot(fitted_SBP, SBP_red, xlab = "Fitted Values", 
     ylab = "Standardized Residuals", 
     main = "Standardized Residuals Plot (SBP)")

#DBP
DBP_red = rstandard(avplot_DBP)
fitted_DBP = avplot_DBP$fitted.values

plot(fitted_DBP, DBP_red, xlab = "Fitted Values", 
     ylab = "Standardized Residuals", 
     main = "Standardized Residuals Plot (DBP)")


From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.

3. No or little multicollinearity
In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and conducted Chi-squared correlation test between each two binary independent variables.

#collinearity for continuous variables
cont_df = df %>% 
  select(age, bmi, alcohol_adj, sleep)
cor(cont_df) 
##                     age         bmi alcohol_adj       sleep
## age          1.00000000 -0.13615718  0.11946370 -0.02952827
## bmi         -0.13615718  1.00000000 -0.21568224 -0.08195345
## alcohol_adj  0.11946370 -0.21568224  1.00000000  0.04966028
## sleep       -0.02952827 -0.08195345  0.04966028  1.00000000
#correlation coefficients have small values
#collinearity for categorical variables
cate_df = df %>%
  select(gender, smoke, overtime)
chisq.test(cate_df$gender, cate_df$smoke)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cate_df$gender and cate_df$smoke
## X-squared = 3.1535, df = 1, p-value = 0.07577
chisq.test(cate_df$gender, cate_df$overtime)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cate_df$gender and cate_df$overtime
## X-squared = 1.1059, df = 1, p-value = 0.293
chisq.test(cate_df$smoke, cate_df$overtime)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cate_df$smoke and cate_df$overtime
## X-squared = 2.9287, df = 1, p-value = 0.08702
#no significant associations


As shown in the correlation table between each two continuous variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.

Here, we fit our linear regression models.


For the systolic blood pressure, we fit the model:
\[ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} \]
And for diatolic blood pressure, we fit the model: \[ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} \]

#core analysis ----------------------------------------------
#linear regression with "overtime" as the main predictor
model_SBP = gls(SBP ~ overtime + gender + age + bmi + 
                  alcohol_adj + sleep + smoke, data = df)
summary(model_SBP)
## Generalized least squares fit by REML
##   Model: SBP ~ overtime + gender + age + bmi + alcohol_adj + sleep + smoke 
##   Data: df 
##        AIC      BIC    logLik
##   2079.396 2110.338 -1030.698
## 
## Coefficients:
##                   Value Std.Error   t-value p-value
## (Intercept)    91.48710 11.615224  7.876482  0.0000
## overtime        1.57529  2.676871  0.588481  0.5568
## genderfemale   -3.62499  2.743774 -1.321168  0.1878
## age             0.47042  0.127233  3.697308  0.0003
## bmi             0.64715  0.202476  3.196195  0.0016
## alcohol_adj     0.01917  0.719598  0.026644  0.9788
## sleep           0.49186  0.980048  0.501874  0.6162
## smokeNotSmoked -1.19423  2.650964 -0.450489  0.6528
## 
##  Correlation: 
##                (Intr) overtm gndrfm age    bmi    alchl_ sleep 
## overtime        0.005                                          
## genderfemale    0.102  0.107                                   
## age            -0.544 -0.071 -0.063                            
## bmi            -0.652 -0.040 -0.238  0.137                     
## alcohol_adj    -0.101  0.029  0.239 -0.130  0.113              
## sleep          -0.637 -0.042 -0.104  0.058  0.103 -0.068       
## smokeNotSmoked -0.033 -0.113 -0.059 -0.123 -0.055  0.180 -0.031
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.2317121 -0.6388468 -0.1245573  0.5466566  3.9927520 
## 
## Residual standard error: 19.20794 
## Degrees of freedom: 238 total; 230 residual
model_DBP = gls(DBP ~ overtime + gender + age + bmi + 
                  alcohol_adj + sleep + smoke, data = df)
summary(model_DBP)
## Generalized least squares fit by REML
##   Model: DBP ~ overtime + gender + age + bmi + alcohol_adj + sleep + smoke 
##   Data: df 
##        AIC      BIC    logLik
##   1919.121 1950.063 -950.5603
## 
## Coefficients:
##                   Value Std.Error   t-value p-value
## (Intercept)    49.75757  8.198020  6.069462  0.0000
## overtime        0.22461  1.889335  0.118884  0.9055
## genderfemale   -4.50410  1.936555 -2.325832  0.0209
## age             0.27600  0.089801  3.073463  0.0024
## bmi             0.53780  0.142907  3.763260  0.0002
## alcohol_adj     0.09092  0.507892  0.179023  0.8581
## sleep           0.33936  0.691717  0.490607  0.6242
## smokeNotSmoked -0.34427  1.871050 -0.183997  0.8542
## 
##  Correlation: 
##                (Intr) overtm gndrfm age    bmi    alchl_ sleep 
## overtime        0.005                                          
## genderfemale    0.102  0.107                                   
## age            -0.544 -0.071 -0.063                            
## bmi            -0.652 -0.040 -0.238  0.137                     
## alcohol_adj    -0.101  0.029  0.239 -0.130  0.113              
## sleep          -0.637 -0.042 -0.104  0.058  0.103 -0.068       
## smokeNotSmoked -0.033 -0.113 -0.059 -0.123 -0.055  0.180 -0.031
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -3.43931088 -0.53464833  0.06774935  0.60387772  2.87705186 
## 
## Residual standard error: 13.55695 
## Degrees of freedom: 238 total; 230 residual


From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant (\(\beta\) = 1.58, p = 0.588). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well (\(\beta\) = 0.22, p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.

Additional Analysis (Hyesue Jang)


Here, I fit logistic regression (as additional analysis) to examine the association between the predictors and the risk of having the abnormal blood pressure level. Following the guide from the American Heart Association, the abnormal systolic and diastolic blood pressure was coded as 1 if the respondent had systolic blood pressure higher than 140mm Hg or diastolic blood pressure higher than 90mm Hg. If not, they were coded 0 to indicate normal blood pressure level.

#logistic regression with normal vs. abnormal BP outcome
logistic_SBP <- glm(SBP_bi ~ overtime + gender + age + bmi + 
                      alcohol_adj + sleep + smoke,
                    family = binomial(link = 'logit'),
                    data = df)
summary(logistic_SBP)
## 
## Call:
## glm(formula = SBP_bi ~ overtime + gender + age + bmi + alcohol_adj + 
##     sleep + smoke, family = binomial(link = "logit"), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4280  -0.9191  -0.6722   1.1867   2.0208  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.16441    1.47603  -4.176 2.96e-05 ***
## overtime        0.23889    0.30418   0.785 0.432245    
## genderfemale   -0.18500    0.31844  -0.581 0.561264    
## age             0.05307    0.01587   3.343 0.000828 ***
## bmi             0.06839    0.02381   2.872 0.004075 ** 
## alcohol_adj     0.06821    0.08061   0.846 0.397494    
## sleep           0.12490    0.11583   1.078 0.280894    
## smokeNotSmoked -0.08405    0.30729  -0.274 0.784450    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 301.1  on 237  degrees of freedom
## Residual deviance: 280.7  on 230  degrees of freedom
## AIC: 296.7
## 
## Number of Fisher Scoring iterations: 4
logistic_DBP <- glm(DBP_bi ~ overtime + gender + age + bmi + 
                      alcohol_adj + sleep + smoke,
                    family = binomial(link = 'logit'),
                    data = df)
summary(logistic_DBP)
## 
## Call:
## glm(formula = DBP_bi ~ overtime + gender + age + bmi + alcohol_adj + 
##     sleep + smoke, family = binomial(link = "logit"), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0715  -0.6532  -0.4741  -0.3401   2.3798  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.74815    1.83339  -3.681 0.000233 ***
## overtime        0.41364    0.37478   1.104 0.269727    
## genderfemale   -0.62077    0.41894  -1.482 0.138402    
## age             0.05215    0.02025   2.576 0.010003 *  
## bmi             0.08619    0.02941   2.931 0.003380 ** 
## alcohol_adj     0.11436    0.09444   1.211 0.225925    
## sleep           0.01283    0.14374   0.089 0.928902    
## smokeNotSmoked -0.28536    0.38274  -0.746 0.455936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 215.54  on 237  degrees of freedom
## Residual deviance: 197.34  on 230  degrees of freedom
## AIC: 213.34
## 
## Number of Fisher Scoring iterations: 5



For systolic blood pressure, the logistic regression model indicated that working overtime is related to increased risk of having abnomral systolic blood pressure (\(\beta\) = 0.24, p = 0.432); however, this association was not statistically significant. Among the covariate, higher age and higher bmi were significantly associated with increased risk of having abnormal systolic blood pressure.
For diastolic blood pressure, the logistic regression model showed that working overtime is related to increased risk of having abnomral diastolic blood pressure (\(\beta\) = 0.41, p = 0.269); however, this association was not statistically significant. Among the covariate, higher age and higher bmi were significantly associated with increased risk of having abnormal diastolic blood pressure.

Python

STATS506_Project_Core_Additional_Analysis
In [28]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import plot_partregress_grid
import matplotlib.pyplot as plt
%matplotlib inline

Core Analysis

After data cleaning, we visualized our data with the matrix plots first.

systolic blood pressure

In [9]:
pair_plot_data_sys_bp = full_dt[['avg_BPXSY', 'gender', 'age', 'bmi', 'sleep', 
                                 'smoke', 'workhrs', 'avg_alcohol_freq_wk']]
ax = pd.plotting.scatter_matrix(pair_plot_data_sys_bp, alpha = 0.5, figsize = (14, 12)); 
plt.xticks(fontsize = 2)
plt.yticks(fontsize = 2);

diatolic blood pressure

In [10]:
pair_plot_data_dia_bp = full_dt[['avg_BPXDI', 'gender', 'age', 'bmi', 'sleep', 
                                 'smoke', 'workhrs', 'avg_alcohol_freq_wk']]
pd.plotting.scatter_matrix(pair_plot_data_dia_bp, alpha = 0.5, figsize = (14, 12)) 
plt.xticks(fontsize = 2)
plt.yticks(fontsize = 2);

As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.

After the preliminary analysis on the data, we fitted our linear regression models.

systolic blood pressure

For the systolic blood pressure, we fit the model: $$ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} $$

In [11]:
model_formula_sys_bp = 'avg_BPXSY ~ gender + age + bmi + sleep + smoke + workhrs + avg_alcohol_freq_wk'
model_sys_bp = smf.ols(formula = model_formula_sys_bp, data = full_dt)
model_fit_sys_bp = model_sys_bp.fit()
In [12]:
print(model_fit_sys_bp.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              avg_BPXSY   R-squared:                       0.089
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     3.192
Date:                Mon, 09 Dec 2019   Prob (F-statistic):            0.00303
Time:                        18:28:00   Log-Likelihood:                -1037.0
No. Observations:                 238   AIC:                             2090.
Df Residuals:                     230   BIC:                             2118.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              90.2929     11.828      7.634      0.000      66.988     113.598
gender                 -3.6250      2.744     -1.321      0.188      -9.031       1.781
age                     0.4704      0.127      3.697      0.000       0.220       0.721
bmi                     0.6472      0.202      3.196      0.002       0.248       1.046
sleep                   0.4919      0.980      0.502      0.616      -1.439       2.423
smoke                   1.1942      2.651      0.450      0.653      -4.029       6.418
workhrs                 1.5753      2.677      0.588      0.557      -3.699       6.850
avg_alcohol_freq_wk     0.0192      0.720      0.027      0.979      -1.399       1.437
==============================================================================
Omnibus:                       35.325   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               52.892
Skew:                           0.877   Prob(JB):                     3.27e-12
Kurtosis:                       4.502   Cond. No.                         522.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

diatolic blood pressure

For the systolic blood pressure, we fitted the following model: $$ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} $$

In [13]:
model_formula_dia_bp = 'avg_BPXDI ~ gender + age + bmi + sleep + smoke + workhrs + avg_alcohol_freq_wk'
model_dia_bp = smf.ols(formula = model_formula_dia_bp, data = full_dt)
model_fit_dia_bp = model_dia_bp.fit()
In [14]:
print(model_fit_dia_bp.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              avg_BPXDI   R-squared:                       0.092
Model:                            OLS   Adj. R-squared:                  0.065
Method:                 Least Squares   F-statistic:                     3.347
Date:                Mon, 09 Dec 2019   Prob (F-statistic):            0.00204
Time:                        18:28:00   Log-Likelihood:                -954.08
No. Observations:                 238   AIC:                             1924.
Df Residuals:                     230   BIC:                             1952.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              49.4133      8.348      5.919      0.000      32.965      65.862
gender                 -4.5041      1.937     -2.326      0.021      -8.320      -0.688
age                     0.2760      0.090      3.073      0.002       0.099       0.453
bmi                     0.5378      0.143      3.763      0.000       0.256       0.819
sleep                   0.3394      0.692      0.491      0.624      -1.024       1.702
smoke                   0.3443      1.871      0.184      0.854      -3.342       4.031
workhrs                 0.2246      1.889      0.119      0.905      -3.498       3.947
avg_alcohol_freq_wk     0.0909      0.508      0.179      0.858      -0.910       1.092
==============================================================================
Omnibus:                       12.349   Durbin-Watson:                   2.163
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               17.669
Skew:                          -0.352   Prob(JB):                     0.000146
Kurtosis:                       4.134   Cond. No.                         522.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant ($\beta = 1.58$, p = 0.557). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well ($\beta = 0.2246$, p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.

After fitting the linear regression model, we checked whether the following assumptions of the linear regression model are held:

  1. Homoscedasticity: The variance of the residual is constant.
  2. Linearity: The relationship between the independent and dependent variables is linear.
  3. No or little multicollinearity: There is no or little collinearity between independent variables.

In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.

systolic blood pressure

In [15]:
# fitted values (need a constant term for intercept)
model_fitted_y_sys_bp = model_fit_sys_bp.fittedvalues

# model residuals
model_resid_sys_bp = model_fit_sys_bp.resid

# standardized residuals
model_std_resid_sys_bp = model_fit_sys_bp.get_influence().resid_studentized_internal
In [16]:
fig, ax = plt.subplots()
sns.residplot(model_fitted_y_sys_bp, model_std_resid_sys_bp, lowess = True,
              scatter_kws = {'alpha': 0.5}, 
              line_kws = {'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.title('standardized residual plot (systolic blood pressure)')
ax.set_xlabel('fitted values')
ax.set_ylabel('standardized residuals')
Out[16]:
<matplotlib.text.Text at 0x112b84550>

diatolic blood pressure

In [17]:
# fitted values (need a constant term for intercept)
model_fitted_y_dia_bp = model_fit_dia_bp.fittedvalues

# model residuals
model_resid_dia_bp = model_fit_dia_bp.resid

# standardized residuals
model_std_resid_dia_bp = model_fit_dia_bp.get_influence().resid_studentized_internal

# sqrt absolute standardized residuals
model_abs_sqrt_std_resid_dia_bp = np.sqrt(np.abs(model_std_resid_dia_bp))

# absolute residuals
model_abs_resid_dia_bp = np.abs(model_resid_dia_bp)
In [18]:
fig, ax = plt.subplots()
sns.residplot(model_fitted_y_dia_bp, model_std_resid_dia_bp, lowess = True,
              scatter_kws = {'alpha': 0.5}, 
              line_kws = {'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.title('standardized residual plot (diatolic blood pressure)')
ax.set_xlabel('fitted values')
ax.set_ylabel('standardized residuals')
Out[18]:
<matplotlib.text.Text at 0x1128f4668>

From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.

In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.

systolic blood pressure

In [19]:
fig = plt.figure(figsize=(8, 12))
plot_partregress_grid(model_fit_sys_bp, fig=fig)
plt.show()

diatolic blood pressure

In [20]:
fig = plt.figure(figsize=(8, 12))
plot_partregress_grid(model_fit_dia_bp, fig=fig)
plt.show()

From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.

In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and the Pearson correlations between each two binary independent variables.

check correlations between the continuous variables

In [21]:
continuous_vars = full_dt[['age', 'bmi', 'sleep', 'avg_alcohol_freq_wk']]
corr_continuous = np.corrcoef(continuous_vars.values.T)
# corr_continuous
In [22]:
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_continuous, annot=True, square=True, annot_kws = {"size": 12})
plt.title("correlations between continuous variables", fontsize = 16)
ax.set_xticklabels(continuous_vars)
ax.set_yticklabels(continuous_vars, va = 'center')
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)

check correlations between categorical variables

In [23]:
categorical_vars = full_dt[['gender', 'smoke', 'workhrs']]
categorical_vars.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson')
Out[23]:
gender smoke workhrs
gender 1.000000 0.123827 -0.077188
smoke 0.123827 1.000000 0.120028
workhrs -0.077188 0.120028 1.000000

As shown in the heatmap of the correlations between the continuous independent variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.

Additional Analysis

Here, I fit the logistic regression (as additional analysis) to examine the association between the predictors and the risk of having the abnormal blood pressure level. Following the guide from the American Heart Association, the abnormal systolic and diastolic blood pressure were coded as 1 if the respondent had systolic blood pressure higher than 140mm Hg or diastolic blood pressure higher than 90mm Hg. If not, they were coded 0 to indicate normal blood pressure level.

In [24]:
# convert the variables systolic blood pressure and diatolic blood pressure into binary variables
full_dt['bin_avg_BPXSY'] = np.where(full_dt.avg_BPXSY >= 140, 1, 0)
full_dt['bin_avg_BPXDI'] = np.where(full_dt.avg_BPXDI >= 90, 1, 0)
In [25]:
# drop the variables avg_BPXSY and avg_BPXDI which are not needed for the logistic models
add_analysis_dt = full_dt.drop(['id', 'race', 'avg_BPXSY', 'avg_BPXDI'], axis = 1)
add_analysis_dt.head()
Out[25]:
gender age bmi sleep smoke workhrs avg_alcohol_freq_wk bin_avg_BPXSY bin_avg_BPXDI
0 0 46.0 27.6 10.0 1 0 6.000000 1 1
1 1 54.0 38.8 8.0 0 0 0.920598 0 0
2 1 45.0 33.2 8.0 0 1 6.000000 0 1
3 0 53.0 23.3 7.0 1 0 2.000000 1 0
4 0 52.0 27.2 8.0 0 0 1.000000 1 1

systolic blood pressure

In [26]:
endog_sys_bp = add_analysis_dt.bin_avg_BPXSY
exog_sys_bp = sm.add_constant(add_analysis_dt.iloc[:, 0:7])
logit_model_sys_bp = sm.Logit(endog_sys_bp, exog_sys_bp)
logit_model_fit_sys_bp = logit_model_sys_bp.fit()
print(logit_model_fit_sys_bp.summary())
Optimization terminated successfully.
         Current function value: 0.589699
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          bin_avg_BPXSY   No. Observations:                  238
Model:                          Logit   Df Residuals:                      230
Method:                           MLE   Df Model:                            7
Date:                Mon, 09 Dec 2019   Pseudo R-squ.:                 0.06776
Time:                        18:28:04   Log-Likelihood:                -140.35
converged:                       True   LL-Null:                       -150.55
                                        LLR p-value:                  0.004764
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -6.2485      1.500     -4.165      0.000      -9.189      -3.308
gender                 -0.1850      0.318     -0.581      0.561      -0.809       0.439
age                     0.0531      0.016      3.343      0.001       0.022       0.084
bmi                     0.0684      0.024      2.872      0.004       0.022       0.115
sleep                   0.1249      0.116      1.078      0.281      -0.102       0.352
smoke                   0.0841      0.307      0.274      0.784      -0.518       0.686
workhrs                 0.2389      0.304      0.785      0.432      -0.357       0.835
avg_alcohol_freq_wk     0.0682      0.081      0.846      0.397      -0.090       0.226
=======================================================================================

For the systolic blood pressure, the logistic regression model indicated that working overtime is related to increased risk of having abnomral systolic blood pressure ($\beta = 0.24$, p = 0.432); however, this association was not statistically significant. Among the covariates, higher age and higher bmi were significantly associated with increased risk of having abnormal systolic blood pressure.

diatolic blood pressure

In [27]:
endog_dia_bp = add_analysis_dt.bin_avg_BPXDI
exog_dia_bp = sm.add_constant(add_analysis_dt.iloc[:, 0:7])
logit_model_dia_bp = sm.Logit(endog_dia_bp, exog_dia_bp)
logit_model_fit_dia_bp = logit_model_dia_bp.fit()
print(logit_model_fit_dia_bp.summary())
Optimization terminated successfully.
         Current function value: 0.414587
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:          bin_avg_BPXDI   No. Observations:                  238
Model:                          Logit   Df Residuals:                      230
Method:                           MLE   Df Model:                            7
Date:                Mon, 09 Dec 2019   Pseudo R-squ.:                 0.08441
Time:                        18:28:04   Log-Likelihood:                -98.672
converged:                       True   LL-Null:                       -107.77
                                        LLR p-value:                   0.01113
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  -7.0335      1.873     -3.754      0.000     -10.705      -3.362
gender                 -0.6208      0.419     -1.482      0.138      -1.442       0.200
age                     0.0522      0.020      2.576      0.010       0.012       0.092
bmi                     0.0862      0.029      2.931      0.003       0.029       0.144
sleep                   0.0128      0.144      0.089      0.929      -0.269       0.295
smoke                   0.2854      0.383      0.746      0.456      -0.465       1.036
workhrs                 0.4136      0.375      1.104      0.270      -0.321       1.148
avg_alcohol_freq_wk     0.1144      0.094      1.211      0.226      -0.071       0.299
=======================================================================================

For the diastolic blood pressure, the logistic regression model showed that working overtime is related to increased risk of having abnomral diastolic blood pressure ($\beta = 0.41$, p = 0.27); however, this association was not statistically significant. Among the covariates, higher age and higher bmi were significantly associated with increased risk of having abnormal diastolic blood pressure.

In [ ]:
 

References

Hayashi, T., Kobayashi, Y., Yamaoka, K., & Yano, E. (1996). Effect of overtime work on 24-hour ambulatory blood pressure. Journal of occupational and environmental medicine, 38(10), 1007-1011.


Lauer, R. M., Mahoney, L. T., Clarke, W. R., & Witt, J. (1993). Childhood predictors for high adult blood pressure: the Muscatine Study. Pediatric Clinics of North America, 40(1), 23-40.


Nakamura, K., Sakurai, M., Morikawa, Y., Miura, K., Ishizaki, M., Kido, T., … & Nakagawa, H. (2012). Overtime work and blood pressure in normotensive Japanese male workers. American journal of hypertension, 25(9), 979-985.


Nishiyama, K., & Johnson, J. V. (1997). Karoshi–death from overwork: occupational health consequences of Japanese production management. International Journal of Health Services, 27(4), 625-641.


Papathanasiou, G., Zerva, E., Zacharis, I., Papandreou, M., Papageorgiou, E., Tzima, C., … & Evangelou, A. (2015). Association of high blood pressure with body mass index, smoking and physical activity in healthy young adults. The open cardiovascular medicine journal, 9, 5.